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Abstract 


Heap  allocation  with  copying  garbage  collection  is  believed  to  have  poor  memory  suhsy.stem  per¬ 
formance.  We  conducted  a  study  of  the  memory  subsystem  performance  of  heap  allocation  for 
memory  subsystems  found  on  many  machines.  We  found  that  many  machines  support  heap  alloca¬ 
tion  poorly.  However,  with  the  appropriate  memory  subsystem  organization,  heap  allocation  can 
have  good  memory  subsystem  performance. 
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1  Introduction 


Heap  allocation  with  copy  ing  garbage  collection  is  widely  believed  to  have  pour  nicniur\'  .siibs\  stem 
performance  [30,  37,  38,  23,  39].  To  investigate  this,  we  conducted  an  extensive  study  of  niemurv 
subsystem  performance  of  heap  allocation  intensive  programs  on  memory  subsystem  organizations 
typical  of  many  workstations.  The  programs,  compiled  with  the  S.\IL;  .NM  compiler  3,,  do  tremen¬ 
dous  amounts  of  heap  allocation,  allocating  one  word  every  to  4  to  10  instruct ion.s.  The  programs 
used  a  generational  copying  garbage  collector  to  manage  their  heaps.  To  our  surprise,  we  found 
that  for  some  configurations  corresponding  to  actual  machines,  such  as  the  DFX'Station  .5000  200. 
the  memory  subsystem  performance  was  comjtarable  to  that  of  C  and  Fortran  programs  10  :  pro¬ 
grams  ran  only  16"?^  slower  than  they  would  have  with  an  infinitely  fast  memorv.  This  performatice 
is  similar  to  that  for  C  and  Fortran  programs  For  other  configurations,  the  slowdown  was  often 
higher  than  100^. 

The  memory  subsystem  features  important  for  achieving  good  performance  with  heap  allocation 
are  subblock  placement  with  a  subblock  size  of  one  word  combined  with  write-allocate  on  write- 
miss,  a  write  buffer  and  page-mode  writes,  and  cache  sizes  of  32K  or  larger.  Heap  allocation 
performs  poorly  on  machines  which  do  not  have  one  or  more  of  these  features;  this  includes  most 
current  workstations. 

Our  work  differs  from  previous  reported  work  30.  37.  .38.  23.  39;  on  memorv  subsvstom  perfor¬ 
mance  of  heap  allocation  in  two  important  ways.  First,  previous  work  ttsed  orcmil  inisti  ratio!<  as 
the  performance  metric  and  neglected  the  potentially  different  costs  of  read  and  write  misses.  Over¬ 
all  miss  ratios  are  misleading  indicators  of  performance;  a  high  overall  miss'  ratio  does  not  ahvavs 
translate  to  bad  performance.  We  separate  read  misses  from  write  misses.  Second,  previous  work 
did  not  model  the  entire  memory  subsystem:  it  concentrated  solely  on  caches.  Memory  subsystem 
features  such  as  write  buffers  and  page- mode  writes  interact  with  the  costs  of  hits  and  misses  in 
the  cache  and  should  be  simulated  to  give  a  correct  picture  of  memory  subsystem  behavior.  Me 
simulate  the  entire  memory  subsystem. 

We  did  the  study  by  instrumenting  programs  to  produce  traces  of  all  memorv  references.  We 
fed  the  references  into  a  memory  subsystem  simulator  which  calculated  a  performance  penalty  due 
to  the  memory  subsystem.  We  fixed  the  architecture  to  be  the  .MIPS  R3000  22]  and  varied  cache 
configurations  to  cover  the  design  space  typical  of  workstations  such  as  DECStations,  SP.-\RCSta- 
tions,  and  HP  9000  series  700.  .A.11  the  memory  subsystem  configurations  we  studied  had  a  write 
buffer  and  page-mode  writes.  We  studied  eight  substantial  programs. 

We  varied  the  following  cache  parameters:  size  (8K  to  12SK),  block  size  (16  or  32  bytes), 
write  miss  policy  (write  allocate  or  write  no  allocate),  subblock  placement  (with  and  without), 
and  associativity  (one  and  two  way).  M’e  simulated  only  split  instruction  and  data  caches,  i.e.. 
no  unified  caches.  We  report  data  only  for  write-through  caches  but  the  results  extend  easily  to 
write-back  caches  (see  Section  5.2). 

Section  2  gives  background  information.  Section  3  describes  related  work.  Section  4  descril)es 
the  simulation  methods  used,  the  benchmarks  used,  and  the  metrics  used  to  measure  memorv 
subsystem  performance.  Section  5  presents  the  results  of  the  simulation  studies,  and  an  analysis 
of  those  results.  Section  6  concludes. 


2  Background 


The  following  sections  describe  memory  subsystems,  copying  garbage  colleriion,  SML.  .uid  tlie 
SML/NJ  compiler. 

2.1  Memory  subsystems 

This  section  reviews  the  organization  of  memory  subsystems.  Since  terminolngy  for  memorv  sul)- 
systems  is  not  standardized  we  use  Przybylski’s  terminology  ;31i. 

It  is  well  known  that  CPUs  are  getting  faster  relative  to  DR.WI  memory  chips;  main  memorv 
cannot  supply  the  CPU  with  instructions  and  data  fast  enough.  A  solution  to  this  proi)lein  is  to 
use  a  cache,  a  small  fast  memory  placed  between  the  CPU  and  main  memorv  that  holds  a  subset  of 
memory.  If  the  CPU  reads  a  memory  location  which  is  in  the  cache,  the  value  is  returned  quicklv. 
Otherwise  the  CPU  must  wait  for  the  value  to  be  fetched  from  main  memory. 

Caches  work  by  reducing  the  average  memory  access  lime.  This  is  possible  since  tnemory 
accesses  exhibit  temporal  and  spatial  locality.  Temporal  locality  means  that  a  memort'  location 
that  was  referenced  recently  will  probably  be  referenced  again  soon  and  is  thus  worth  'toring  in 
the  cache.  Spatial  locality  means  that  a  memory  location  near  one  which  was  referenced  recent Iv 
will  probably  be  referenced  soon.  Thus,  it  is  worth  moving  the  neighboring  locations  to  the  caciie. 

2.1.1  Cache  organization 

This  section  describes  cache  organization  for  a  single  level  of  caching.  .\  cache  is  divided  into  blocks. 
each  of  which  has  an  associated  tag.  A  cache  block  represents  a  block  of  memory.  Cache  blocks 
are  grouped  into  sets.  A  memory  block  may  reside  in  the  cache  in  exactly  one  set.  but  may  reside 
in  any  block  within  the  set.  The  tag  for  a  cache  block  indicates  what  memory  block  it  holds. 
cache  with  sets  of  size  n  is  said  to  be  n-way  associative.  If  n=l,  the  cache  is  called  direct- mapped. 
Some  caches  have  valid  bits,  to  indicate  what  sections  of  a  block  hold  valid  data.  A  subblock  is 
the  smallest  part  of  a  cache  with  which  a  valid  bit  is  associated.  In  this  paper,  subblock  placement 
implies  a  subblock  size  of  one  word,  I'.e.,  valid  bits  are  associated  with  each  word.  .Moreover,  on  a 
read  miss,  the  whole  block  is  brought  into  the  cache  not  just  the  subblock  that  missed.  Przvbylski 
i31]  notes  that  this  is  a  good  choice. 

A  memory  access  for  which  a  block  is  resident  in  the  cache  is  called  a  hit.  Otherwise,  the 
memory  access  is  a  miss. 

A  read  request  for  memory  location  m  causes  m  to  be  mapped  to  a  set.  .\11  the  tags  and  valid 
bits  (if  any)  in  the  set  are  checked  to  see  if  any  block  contains  the  memory  l)lock  for  m.  If  a  cache 
block  contains  the  memory  block  for  m,  the  word  corresponding  to  m  is  selected  from  the  cache 
block.  A  read  miss  is  handled  by  copying  the  missing  block  from  the  main  memorv  to  the  cache. 

.4  write  hit  is  always  written  to  the  cache.  There  are  several  policies  for  handling  a  write  miss, 
differing  in  their  performance  penalties.  For  each  of  the  policies,  the  actions  taken  on  a  write  miss 
are; 

1.  write  no  allocate: 

•  Do  not  allocate  a  block  in  the  cache 

•  Send  the  write  to  main  memory,  without  putting  the  write  in  the  cache. 

2.  write  allocate,  no  subblock  placement; 


•  Allocate  a  block  in  the  cache. 

•  Fetch  the  corresponding  memory  block  from  main  memory. 

•  Write  the  word  to  the  cache  and  to  memory. 

3.  write  allocate,  subblock  placement^: 

•  Allocate  a  block  in  the  cache. 

•  Write  the  word  to  the  cache  and  to  memory. 

•  Invalidate  the  remaining  words  in  the  block. 

Write  allocate /subblock  placement  will  have  a  lower  write  miss  penaitv  than  write  alloral'  no 
subblock  placement  since  it  avoids  fetching  a  memory  block  from  main  memorv.  In  addition,  ii 
will  have  a  lower  penalty  than  write  no  allocate  if  the  written  word  is  retid  Itefore  beimi  ex  u  ied 
from  the  cache.  See  Jouppi  21;  for  more  information  on  write  miss  policies. 

miss  is  a  compulsory  miss  if  it  is  due  to  a  memory  Itlock  being  accessed  for  the  first  time. 
A  miss  is  a  capacity  miss  if  it  results  from  the  cache  (size  (")  not  Iteing  big  enough  to  hold  till  the 
memory  blocks  used  by  a  program.  This  corres()onds  to  the  misses  in  a  fnllv  associative  c;iclu>  of 
size  C  with  LRU  replacement  policy  (minus  the  compulsory  misses).  It  is  a  conflict  miss  if  it  results 
from  two  memory  blocks  mapping  to  the  same  set.  19 

.4  write  buffer  may  be  used  to  reduce  the  cost  of  writes  to  main  mtunorv.  .\  wriir  buff'  r  is  a 
queue  containing  writes  that  arc  to  Ite  sent  to  main  memorv.  Wheti  the  C'lM’  does  a  write,  the 
write  is  placed  in  the  write  buffer  and  the  CPU  continues  withotit  waiting  for  tiie  write  to  finisii. 
The  write  buffer  retires  entries  to  main  memorv  using  free  memorv  cvcles.  .\  wrih  buff'  r  >t.ill 
occurs  if  the  write  buffer  is  full  when  the  CPU  tries  to  do  a  write  or  tries  to  read  a  local ioti  ([ueited 
up  in  the  write  buffer. 

Main  memory  is  divided  into  DRAM  pages.  Page-mode  writes  reduce  tlie  latency  of  writes  to 
the  same  DRAM  page  when  there  are  no  intervening  memory  accesses  to  another  DR.\M  page. 

2.1.2  Memory  subsystem  performance 

This  section  describes  two  metrics  for  measuring  the  performance  of  memorv  subs\stems.  One 
popular  metric  is  the  cache  miss  ratio.  The  cache  miss  ratio  is  the  number  of  memorv  accesses  liiat 
miss  divided  by  the  total  number  of  memory  accesses.  Since  different  kinds  of  memorv  accesses 
usually  have  different  miss  costs,  it  is  useful  to  have  miss  ratios  for  each  kind  of  access. 

Cache  miss  ratios  alone  do  not  measure  the  impact  of  the  memorv  subsystem  on  overall  system 
performance.  A  metric  which  better  measures  this  is  the  contribution  of  rite  memorv  subsvstern  lo 
CPI  (cycles  per  useful  instruction").  CPI  is  calculated  for  a  program  as  nnmlwr  of  CPC  '-yeb  s  In 
complete  a  program  /  total  number  of  useful  instructions  executed.  It  measures  Itow  etlicienilv  i  lie 
CPU  is  being  utilized.  The  contribution  of  the  memory  subsystem  to  CPI  is  calculated  as  n timber  of 
CPU  cycles  spent  waiting  for  the  memory  subsystem  '  total  number  of  useful  instructions  i  xi  cutcd. 
As  an  example,  on  a  DECStation  5000/200,  the  lowest  CPI  possible  is  1 ,  completing  one  insi  met  ion 
per  cycle.  If  the  CPI  for  a  program  is  1.50,  and  the  memory  contribution  lo  CPI  is  0..'5.  2()''c  i>f 
the  CPU  cycles  are  spent  waiting  for  the  memory  subsystem  (the  rest  mav  Ite  due  to  utlier  catises 

‘Recall  subblock  size  is  assumed  to  be  1  word. 

^.4,11  instructions  besides  nops  are  considered  to  be  useful.  .\  nop  (null  operation)  iiisiriii  i  ion  is  ,i  software 
controlled  pipeline  stall 


*/,  check  for  heap  overflow 
cmp  alloc+12,top 
branch-if-gt  call-gc 
7,  write  the  object 
store  tag, (alloc) 
store  ra,4(alloc) 
store  rd,8(alloc) 

7,  save  pointer  to  object 
move  alloc+4, result 
7.  add  12  to  alloc  pointer 
add  alloc, 12 


Figure  1:  Pseudo-assembly  code  for  allocating  an  object 


such  as  nops,  multi-cycle  instructions  like  integer  division,  etc.).  CPI  is  machine  depemieni  >ince 
it  is  calculated  using  actual  penalties. 

2.2  Copying  garbage  collection 

A  copying  garbage  collector  17.  11  reclaims  an  area  of  memorv  bv  copving  all  ihe  live  itmn- 
garbage)  data  to  another  area  of  memory.  This  means  that  all  data  in  the  garbage-collected  area 
is  now  garbage,  and  the  area  can  be  re-used.  Since  memorv  is  always  reclaimed  in  large  contiguous 
areas,  objects  can  be  sequentially  allocated  from  such  areas  at  the  cost  of  onlv  a  few  instructions. 
Figure  1  gives  an  example  of  pseudo-assembly  code  for  allocating  a  cons  cell,  ra  contains  the  car 
cell  contents,  rd  contains  the  cdr  cell  contents,  alloc  is  the  address  of  the  next  free  word  in  the 
allocation  area,  and  top  contains  the  end  of  the  allocation  area. 

The  SML/'NJ  compiler  uses  a  simple  generational  copying  garbage  collector  27  .  .Memorv  is 
divided  into  an  old  generation  and  an  allocation  area.  New  objects  are  created  in  the  allocation 
area;  garbage  collection  copies  the  live  objects  in  the  allocation  area  to  the  old  generation,  freeing 
up  the  allocation  area.  Generational  garbage  collection  relies  on  the  fact  that  most  allocated  objects 
die  young;  thus  most  objects  (about  99%  [3,  p.  206  )  are  not  copied  from  the  allocation  area.  This 
makes  the  garbage  collector  efficient,  since  it  works  mostly  on  an  area  of  memory  where  it  is  very 
effective  at  reclaiming  space. 

The  most  important  property  of  a  copying  collector  with  respect  to  memory  siibs  vstern  behavior 
is  that  allocation  initializes  memory  which  has  not  been  touched  in  a  long  lime  and  is  thus  unlikely 
to  be  in  the  cache.  This  is  especially  true  if  the  alloc.ation  area  is  large  relative  t  •  the  si/e  of  the 
cache  since  allocation  will  knock  everything  out  of  the  cache.  This  means  that  for  small  caches 
there  will  be  a  large  number  of  (write)  misses. 

For  example  consider  the  code  in  Figure  1.  .-Vssume  that  a  cache  write  miss  costs  10  CPI  cvrles 
and  that  the  block  size  is  4  words.  On  average,  every  fourth  word  allocated  causes  a  write  miss. 
Thus,  the  average  memory  subsystem  cost  of  allocating  a  word  on  the  heap  is  4  cvcles.  I'lie  average 
cost  for  allocating  a  cons  cell  is  seven  cycles  (at  one  cvcle  per  instruction)  plus  12  cvrles  for  tlie 
memory  subsystem  overhead.  Thus,  while  allocation  is  cheap  in  terms  of  instruction  counts,  it  is 
expensive  in  terms  of  machine  cycle  counts. 
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2.3  Standard  ML 


Standard  ML  (SML)  ;29|  is  a  call- l)y- value,  lexically  scoped  language  with  liit'lier  i-nler  riiiiciions. 
garbage  collection,  static  typing,  a  polymorphic  type  system,  provable  safctv  properties,  a  sophis¬ 
ticated  module  system,  and  a  dynamically  scoped  exception  mechanism. 

SML  encourages  a  non-imperative  programming  style.  \’ariables  cannot  be  altered  once  ihev 
are  bound,  and  by  default  data  structures  cannot  be  altered  once  they  are  created.  Lis[)'s  rplaca 
and  rplacd  do  not  exist  for  the  default  definition  of  lists  in  SML.  The  onlv  kinds  of  assittnable  data 
structures  are  ref  cells  and  arrays^,  which  must  be  explicitly  declared.  To  emphasis  the  point, 
assignments  are  permitted  but  discouraged  as  a  general  programming  stvle.  The  implications  of 
this  non-imperative  programming  stvle  for  compilation  are  clear;  SML  programs  ttuid  to  do  more 
allocation  and  copying  than  programs  written  in  imperative  languages. 

SML  is  most  closely  related  to  Lisp  and  Scheme  .  Implementation  techniques  for  one  l)^tlll'^c‘ 
languages  are  mostly  applicable  to  the  other  langttages.  with  the  following  caveats;  SMI.  pri'grams 
tend  to  be  less  imperative  than  Lisp  or  Scheme  programs  and  Scheme  and  SMI.  j)roitratns  use 
function  calls  more  fretpientlv  than  Lisp,  since  recursion  is  the  usual  wav  to  acliieve  iteration  in 
Scheme  and  SML. 

2.4  SML/NJ  compiler 

The  SML  N'J  compiler  ’>  is  a  publiciv  available  compiler  for  S.ML.  W'e  used  vi'rsi"n  H.IM.  The 
compiler  concentrates  on  making  allocation  cheap  and  futicti<m  call.^  fast.  .Mlocaiion  is  dmie  in¬ 
line.  except  for  the  allocation  of  arravs.  .\ggressive  .3-reduclion  (inlining)  is  used  to  eliminate 
functions  calls  and  their  associated  overhead,  l-'unction  arguments  are  passed  in  registers  when 
possible,  and  register  targeting  is  used  to  minimize  register  shuHling  at  function  calls.  split 
caller/ callee-save  register  convention  is  tised  to  avoid  excessive  spilling  of  registers.  The  compiler 
also  does  constant- folding,  elimination  of  functions  which  trivially  call  other  functions,  limited  code 
hoisting,  uncurrying,  and  instruction  scheduling. 

The  most  controversial  design  decision  in  the  compiler  was  to  allocate  procedure  acti\atii>n 
records  on  the  heap  instead  of  the  stack  I,  .5;.  In  principle,  the  presence  of  higher-' irder  functions 
means  that  procedure  activation  records  must  be  allocated  oii  the  heap.  With  a  Miitable  analvsis. 
a  stack  can  be  used  to  store  most  activation  records  21  .  However,  using  onlv  a  heap  simj)lifies 
the  compiler,  the  run-time  system  2.  and  the  implementation  of  first-elass  coniinnaiions  IS. 
The  decision  to  use  only  a  heap  was  controversial  because  it  greatly  incrca.ses  tlie  amount  of  heap 
allocation,  which  is  believed  to  cause  poor  mentory  subsystem  performance. 

3  Related  Work 

There  have  been  manv  studies  of  tlie  cache  behavior  of  svstems  using  heap  allocation  atnl  some  form 
of  copying  garbage  collection.  Peng  and  .Sold  examined  the  data  cache  behavior  of  some  small 
Lisp  programs.  They  used  trace-driven  simulation,  and  proi)osed  an  .M.LOC.V  IT',  instrtiction  for 
improving  cache  behavior,  which  allocates  a  block  in  the  cache  without  fetching  it  from  memorv. 
Wilson  et.  al.  .IT,  38'  argued  that  cache  performance  of  programs  with  generational  garbage 
collection  will  improve  substantially  when  the  youngest  generation  fits  in  the  cache.  Koopman  '  i. 
al.  23  studied  the  effect  of  cache  organization  on  combinator  graph  reduction,  an  implementation 


’.Mthough  the  language  definition  omitted  arrays,  ail  implementations  have  arrays. 
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technique  for  lazy  functional  programming  languages.  Combinator  graph  recluction  does  itiorc 
heap  allocation  and  assignments  than  SMh,  .\J  programs.  They  observed  the  importance  of  a 
write-allocate  policy  with  subblock  placement  for  improving  heap  allocation.  Zorn  iii);  studied  the 
impact  of  cache  behavior  on  the  performance  of  a  Common  Lisp  system,  when  stop-aiul-copy  and 
mark-and-sweep  garbage  collection  algorithms  were  used.  He  concluded  that  programs  run  wit  It 
mark-and-sweep  have  substantially  better  cache  locality  than  when  run  with  stop-and-copy. 

These  works  all  used  data  cache  miss  ratios  to  evaluate  caclie  iterformance.  Thev  did  not 
separate  read  and  write  misses,  despite  the  difTerent  costs  of  these  misses.  .Vlso,  they  did  not 
simulate  the  entire  memory  subsystem.  Our  work  separates  read  misses  from  write  misses  and 
completely  models  the  memory  subsystem,  including  write  bulTers  and  page-mode  writes. 

.\ppel  '3l  estimated  CPI  for  the  SML/N.I  svstem  on  a  single  machine  itsing  elapsed  time  and 
instruction  counts.  His  CPI  differs  substantially  from  ours.  .-Vpparentlv  instructions  were  under¬ 
counted  in  his  measurements  4  . 

Jouppi  [21]  studied  the  effect  of  cache  write  policies  on  tlie  performance  of  C  anti  Fortran 
programs.  Our  class  of  programs  is  different  from  his.  but  his  conclusions  support  ours;  that  a 
write-allocate  policy  with  subblock  placement  is  a  liosirable  architecture  feature.  He  found  that 
the  write  miss  ratio  for  the  programs  he  studied  was  comparable  to  tlie  read  miss  ratio,  anti  tii;it 
write-allocate  with  subblock  placement  eliminated  the  cost  of  write  misses.  For  programs  compiled 
with  the  SML/XJ  compiler,  this  is  even  more  important  due  to  the  higli  number  of  write  mis.ses 
caused  by  allocation. 

4  Methodology 

We  used  trace-driven  simulations  to  evaluate  the  memory  subsystem  perfimmance  of  programs. 
For  trace-driven  simulations  to  be  useful,  there  must  be  an  accurate  simulation  ntodel  and  a  good 
selection  of  benchmarks.  Simulations  that  make  simplifying  assumptions  about  important  aspects 
of  the  system  being  modeled  can  yield  misleading  results.  Toy  benchmarks,  or  unrepresentative 
benchmarks,  can  be  equally  misleading.  We  have  devoted  much  elfort  to  addressing  those  issues. 

4.1  Tools 

We  have  extended  QPT  7,  2.5.  26]  to  produce  memory  traces  for  SML  N'.I  programs.  OPT  rewrites 
an  executable  program  to  produce  a  full  instruction  and  data  trace.  Because  QPT  operates  on  tlie 
executable  program,  it  can  trace  both  the  SML  code  and  the  garbage  collector  (written  in  C). 

We  used  Tycho  [20|  for  the  memory  subsystem  simulations.  Tycho  uses  a  special  case  of  all¬ 
associativity  simulation  [28]  to  simulate  multipie  caches  concurrently.  We  liave  added  a  write-buffer 
simulator  to  Tycho,  which  concurrentlv  simulates  a  write  buffer  for  each  instruction  and  data  cache 
pair  being  simulated.  The  write-buffer  simulator  also  lakes  page-mode  writes  and  memorv  rofreslies 
into  consideration. 

4.2  Simplifications  and  Assumptions 

We  wanted  to  simulate  the  memorv  subsystems  as  completely  as  we  could.  Thus,  we  have  tried  to 
minimize  simplifications  which  may  reduce  the  validity  of  our  data.  The  most  important  simplifi¬ 
cations  are: 

1.  We  ignore  the  effects  of  context  switches  and  svstem  calls. 
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2.  Our  simulations  are  driven  by  virtual  addresses  even  tlioiigli  maiiv  current  maciiines  liave 
physically-addressed  caches. 

3.  We  use  default  compilation  flags  which  enable  extensive  optimizations.  We  set  the  soft  limit 
of  the  garbage  collector  to  20000K^. 

4.  When  comparing  different  cache  organizations  we  assume  that  the  CPU  cvcle  time  is  the 
same. 

4.3  Benchmarks 

Table  1  describes  the  benchmark  programs'’.  Knuth-Bevdix.  Lcxtjen.  Life.  Simple.  \'LI\V.  and 
YACC  are  identical  to  the  benchmarks  measured  bv  .Vpjjel  .'5  lable  2  gives  the  sizes  df  the 
benchmarks  in  terms  of  lines  of  SML  code  (excluding  comments  and  blank  lines),  maximum  heap 
size  in  kilobytes,  size  of  the  compiled  code  in  kilobytes  (does  not  include  the  garbage  collector  and 
other  run-time  support  code  which  is  about  60K)‘.  and  rtm  time,  in  seconds,  on  a  DUC.Station 
.5000/200.  The  run  times  are  the  minimum  of  live  runs. 

Table  3  characterizes  the  memory  references  of  the  benchmark  programs.  The  irr/7r.s  column 
lists  the  number  of  full  word  writes  done  by  the  program  avd  the  garbage  collector:  the  Aa-iigrirnents 
column  lists  the  non-initializing  writes  done  by  the  program  onlv.  The  Partial  li'ct/r  .s  column  lists 
the  number  of  partial  word  (bytes,  half-word,  etc.)  writes  done  bv  the  ])rogram  and  the  trarltage 
collector®.  .A.11  the  benchmarks  have  long  traces;  most  other  work  on  memorv  svsiem  performance 
uses  traces  that  are  an  order  of  magnitude  smaller.  The  benchmark  programs  do  fe  .  assigitnients: 
the  majority  of  the  writes  are  initializing  writes. 

Table  4  gives  the  allocation  statistics  for  each  benchmark  program  .  .Ml  allocation  and  sizes 
are  reported  in  words.  The  .Mlocation  column  lists  the  total  allocation  done  by  the  benchmark.  The 
remaining  columns  break  down  the  allocation  bv  kind:  closures  for  escaping  functions,  closures  for 
known  functions,  closures  for  callee-save  continuations'*',  records,  and  others  (includes  spill  records, 
arrays,  strings,  vectors,  ref  cells,  store  list  records,  and  floating  point  numbers).  For  each  allocation 
kind,  the  %  column  is  the  percentage  of  total  allocation  allocated  for  that  kind  of  object  and  Size 
is  the  average  size  (including  the  1  word  tag)  for  that  kind  of  object. 

4.4  Metrics 

We  state  cache  performance  numbers  in  cycles  per  useful  instruction  ( CPI).  .\11  instructions  besides 
nops  are  considered  useful. 


^This  is  large  enough  to  allow  the  garbage  collector  to  resize  the  heap  as  needed. 

^Available  from  the  authors. 

®The  description  of  these  benchmarks  have  been  copied  from  V . 

^The  code  size  includes  207K  for  the  standard  libraries. 

*  Partial- word  writes  are  distinguished  from  full-word  writes  since  they  are  often  more  expensive  than  full-word 
writes.  We  charge  U  cych  -  for  each  partial- word  write. 

’This  table  corrects  one  given  in  the  POPL  '94  paper,  which  did  not  include  allocation  tlat.a  for  floating  point 
numbers.  Our  thanks  to  Darko  Stefanovic  for  bringing  this  to  our  attention. 

'’closures  for  callee-save  continuations  can  be  trivially  allocated  on  a  stack  in  the  absence  of  first  class 
continuations. 


/ 


I  Program 


Description 


CW 


Leroy 

Lexgen 

Life 

PIA 


Simple 

\-LIW 

YACC 


The  Concurrency  Workbench  (121  is  a  tool  for  analyzing  networks  of  linite 
state  processes  expressed  in  Milner's  Calculus  of  Cominnnical iiig  Systems. 
An  implementation  of  the  Knuth-Pendix  completion  algorithm. 

A  lexical-analyzer  generator  fi’.  processing  the  lexical  description  of  Staii- 

dard  ML. _ \ _ _ _  . 

The  game  of  Life  implemented  using  lists  ;52  . 

The  Perspective  Inversion  .\lgorithm  iUii  decides  the  location  (jf  an  ob  ject 

in  a  perspective  video  image.  _ _  _ _  _  _  _ 

A  spherical  fluid-dynamics  program  IIV. 

I  A  \’ery-Luiig-lustriictiou-\\'ortl  instruction  scheduler. 

I  .‘X.n  implementation  of  an  L.\LI{,(1)  parser  generator  proci-ssing  t  he  gram- 


Table  1:  Benchmark,  Programs 


Si/e 

Run  unie 

Program 

I.ines 

Heap  si/e  ( K  )  (’ 

i>de  SI/  (  K ) 

.Non-gc  (ser)  Gc 

1  sec  1 

CW 

5728 

1 107 

.891 

22,74 

3.08 

Knuth-Bendix 

491 

2768 

251 

13.47 

1.18 

Lexgen 

1221 

2162 

305  . 

15.07 

1 .06 

1  Life 

in 

1026  ' 

221 

16.97 

0.19 

pPIA 

,  1454 

I  1025  i 

291 

6.07  i 

0.34 

1  Simple 

999 

11571  . 

311  ' 

2.T.58 

4.2.3 

1  VLIW 

3207 

1088 

486 

23.70 

1.91 

J  TACC 

5751 

1632  ; 

580 

4.60 

1.98 

Table  2:  Sizes  of  Benchmark  Programs 


Program 

Inst  Fetches 

Reads  (%) 

;  Writes  (%) 

P.artial  Writes  (%) 

.\ssignmcnLs  ( ) 

xops  (-n 

CW 

;  523,245,987 

17.61 

11.61 

0,01 

II  :n  " 

13.2  1 

Knuth-Bendix 

■  312,086,438 

19.66 

22.31 

0.00  : 

0.00 

5,92 

Lexgen 

^  328.422,283  , 

16,08 

1  I0.,14  ; 

0.20 

0.21 

12.3,3 

i  Life 

,  413,536,662 

12.18 

I  9.26 

0.00  ' 

0.00 

K5.4,5 

PIA 

'  122.215,151  [ 

25.27 

16.50  , 

0.00  i 

0.00 

8.39 

1  Simple 

604,611,016  ! 

23.86 

I  1 4.06  I 

0.00  ; 

O.05 

7.38 

,  VLIW 

399,812.033  i 

17.89 

15.99 

0.10  1 

0,77 

VMM 

i  YACC 

133,043.324 

18.49 

I  14.66  ■ 

0.32  ! 

0.38 

11.11 

Table  3:  Characteristics  of  l)enchmark  programs 
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.Xllocalion 

Eiica 

ping 

Ki 

lown 

Callee 

Saved 

Ilec 

ords 

(  )i  1 

\rt 

Program 

(words) 

Size 

U 

Size 

C' 

c 

Si/C 

1- 

Si/c 

( 

Size 

i  CW 

56, ^67, ‘1*40 

4.0 

:  4.12 

.3.3 

15.39 

■.  67.2  . 

C  20 

19. ,5 

3.0) 

(3.1) 

4.00 

j  Knuth-Bendix 

i  67,733.930  \ 

37.6 

6.60 

0.1 

i  15.22 

I  ‘49-5  1 

4.90 

12.7 

3.00 

0.1 

1  5.05 

i  Lexgen 

I  .33,046,3-19  1 

3.4 

!  6,20  ! 

5.-4 

;  12.96 

■  72.7  ' 

6.40 

15.1 

.3,00 

3.7 

6.97 

1  Life 

:  37,840,681  i 

0.2 

3,45 

0.0 

^  15.00 

:  77.8  : 

5.52 

22.2 

3.00 

0.1 1 

10,29 

PI.A. 

i  18.841,256 

0.4 

5.56  ' 

28.0 

11.99 

,  25.0 

4.69 

■  12.7 

3. 1 1 

3. 3. 9 

3.22 

‘  Simple 

i  80.761,644 

4.0 

5.70 

1.1 

15.33 

68.1 

6.13 

,S..i 

S.Oi.i 

1.S.5 

3.11 

i  VLIW 

■  59,497.132  . 

9.9 

5.22  ' 

6.0 

;  26.62 

61.8  ‘ 

7.67 

20.3 

3.01 

2.) 

2.  CO 

I  YACC 

'  17.0]5,250 

2.3 

4.83 

15.3 

15.35 

i  54.8  ’ 

7.14 

23  7 

3.0  1 

4.0 

10.22 

Table  1:  Allocation  characteristics  of  benchmark  programs 

Table  5  lists  the  penalties  used  in  the  simulations.  These  numbers  are  derived  from  the  penalties 
for  the  DECStation  5000 '200,  but  are  similar  to  those  in  other  machines  of  the  same  class.  Note 
that  write  misses  have  no  penalty  (besides  write  buffer  costs)  fur  caches  with  subltlock  itlaccuu-nt '  \ 


5  Results  and  Analysis 

Section  5.1  qualitativelv  analyzes  the  meinorv  behavior  of  programs.  Section  5.2  lists  tlu'  cache 
configurations  simulated  and  e.vplains  why  they  were  selected.  Sections  5.2,  presents  and  analvzes 
data  for  memory  subsystem  performance. 

5.1  Qualitative  Analysis 

Recall  from  Section  2  that  SML/NJ  uses  a  copying  collector  which  leads  to  a  large  number  of  write 
misses.  The  slowdown  this  translates  into  depends  on  the  cache  organization  being  used. 

Recall  from  Section  4. .3  that  SML/NJ  programs  have  the  following  properties.  First,  thev  do  few 
assignments;  the  majority  of  the  writes  are  initializing  writes.  Second,  programs  do  heap  allocation 
at  a  furious  rate:  0.1  to  0.22  words  per  instruction.  Third,  writes  come  in  bunches  because  thev 
correspond  to  initialization  of  a  newly  allocated  area. 

The  burstiness  of  writes  combined  with  the  property  of  copying  collectors  mentioned  above 
suggests  that  an  aggressive  write  policy  is  necessary.  In  particular,  writes  should  not  stall  the 
CPU.  Memory  subsystem  organizations  where  the  CPU  has  to  wait  for  a  write  to  be  written  to 
memory  will  perform  poorly.  Even  memory  subsystems  where  the  CPU  does  not  need  to  wait  for 
writes  if  they  are  issued  far  apart  (e.g.,  2  cycles  apart  in  the  HP  9000  series  700)  mav  perform 
poorly  due  to  the  bunching  of  writes.  This  leads  to  two  requirements  on  the  rnemorv  sultsvstem. 
First,  a  write  buffer  or  fast  page  mode  writes  are  essential  to  avoid  waiting  for  writes  to  rnemorv. 
Second,  on  a  write  miss,  the  memory  subsystem  must  avoid  reading  a  cache  block  from  rnemorv  if 
it  will  be  written  before  Ijeing  read.  Of  course,  this  requirement  onlv  holds  for  caches  with  a  irrilr- 
allocate  policy.  Subblock  placement  23],  a  block  size  of  1  word,  and  the  .NLLOC.VTF.  directive  f.O 
can  all  achieve  this^^.  For  large  caches,  when  the  allocation  area  fits  in  the  cache  and  thus  there 


“In  an  actual  implementation,  the  penalty  of  a  miss  may  be  one  cycle  since  unlike  hits,  the  Uir  and  valid  bits 
needs  to  be  written  to  the  cache  after  the  miss  is  detected.  This  will  not  change  our  results  since  it  adils  ,it  most 
0.02-0.05  to  the  CPI  of  caches  with  subblock  placement. 

“Since  the  effects  on  cache  performance  of  these  features  are  so  similar,  we  talk  just  about  ..iibblock  placement 
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Task 


I’cii.'iliy  I  m  i  \  l  ies ) 


Noil  page  mode  write 
Page  mode  write 
Read  16  bytes  from  meinorv 
Read  32  bytes  from  memory 
Write  hit  or  miss  (subblocks) 

Write  hit  (16  bytes,  no  siibblocks) 
U'rite  hit  (32  bytes,  no  subblocks) 
Write  miss  (16  bytes,  no  subblocks) 
Write  miss  (32  bytes,  no  subblocks) 


^  r " 
16  " 

iT" 

.. 

I) 

1 1 

hV  " 


Tabic  5:  Pcnalt  ios  i>f  nii'inorv  nix'ralinns 


Write  Policy  Write  Miss  Policy  Write  Biilfer  Siibblocks  .\ssor  lihu  k  Size  < '.n  he  Si/es 


1  firoui;li 

allocate 

6  ,|cep 

\  cs  1 

.  >  hi.:’,':  bur-  'K 

U'-K 

tlirniigh 

allocalc 

6  ili‘e|) 

no  i 

.  :  It)..'! 2  Ijvlrs  sK 

12sK 

through 

no  allocate 

6  ileep 

no  1 

.  :  10.'!2  i)vics  .sK 

12  SK 

Table  B;  Carlie  oroaiiizai ions  siuiiicti 
are  few  write  misses,  ibe  l)eru'fit  of  snliblock  placemeiii  will  bo  rciitircil. 

5.2  Cache  configurations  simulated 

Since  the  design  space  for  memory  subsystems  is  enormous  we  hati  to  [trune  the  liesigti  s])ace  that 
we  could  study.  In  this  study,  we  restrict  ourselves  to  features  found  in  currrritly  popular  RISC 
workstations.  E.yploralion  of  more  exotic  menn>rv  siibsvsUMn  fc.alures  is  left  to  fiiiure  wurk.  lable 
6  summarizes  the  cache  organizations  simulated.  Table  7  lisis  the  memoit  >ubNVsiem  cirgaiii/ai  i'm 
for  some  popular  machines. 

We  simulated  only  separate  instruction  and  data  caches  (t.i  ..  no  unilied  caches).  While  manv 
current  machines  have  separate  caches  (e.g.,  DECStat  ions.  HP  700  series),  there  arc  some  e.yre[)t  ions 
(notably  SP.\RCs). 

We  simulated  cache  sizes  from  8K  to  128K.  This  range  includes  the  jtrimarv  caches  of  must 
current  machines  (see  Table  7).  We  consider  onlv  direct  map[)ed  and  two-wa\-  set  a.ssoriati\e  caches 
(with  LRU  replacement). 

We  simulated  block  sizes  of  IG  bvtes  and  ’.VI  bvtes.  Przvbvlski  .'{ 1  notes  that  block  si/i's  of  1  (i 
or  32  bytes  optimize  the  read  access  time  for  the  niernorv  parameters  used  in  the  CPI  c.ilculai  ions 
(see  Section  1. 1). 

\Ve  report  data  onlv  for  wriie-tbrougli  caches  but  the  CPI  for  write  back  caches  laii  be  iiiferreii 
from  our  graphs.  Write-through  and  write  back  caches  give  identical  misses,  but  the  penalties  lor 
write  hits  and  write  misses  differ.  .A  write  hit  or  miss  in  a  write-baek  caehe  mav  take  one  cicle 
more  than  in  a  write-through  cache  21  .  This  tells  us  at  most  how  much  the  writ e  t hrouith  graphs 
need  to  be  shifted  to  obtain  the  CPI  graj)hs  for  write  back  caches.  Tor  instance,  if  tlu'  prouram 
has  w  writes  and  n  useful  instructions,  then  we  must  add  w  n  to  the  CPI.  Tor  CW  this  adds  (1.13. 
Write-through  and  write  hack  caches  mav  have  ililTerent  write  hutfer  ])en.ilt ies.  We  export  the  write 
buffer  penalties  for  write  back  caches  to  be  smaller  than  that  for  write  t lirougli  caches  since  writes 


U) 


Architrcture  Wriir  l^ohcv  Writr  Miss  P«)lic>  Write  Buffer  Suf)hl*i{ks  Assor  iilork  Size  (’arhr  '^i/e 

DS3100  16  through  allocate  1  ^leep  -  1  1  i'vic^  olK 

DS5000/200  15'  through  allocate  6  deep  ves  1  16  64K 

HP  9000  34  back  allocate  none  no  1  32  bvtcs  64K  2M 

SPARCStation  II  14  through  no  allocate  4  deep  no  1  32  bvics  tllK 


Note: 

•  SPARCStations  have  unified  caches. 

•  Most  HP  9000  series  700  caches  are  much  smaller  than  2M.  1  2SK  .nsirtjcij«.  rj  racne  ariU  2!;.oK  Uaia  '.acfir  i  -r  'ii'-dr.s  72u 
and  730,  and  256K  instrucii«.>n  cache  and  256K  data  cache  f*»r  !n**tiel  '.lO. 

•  The  DS5000/200  actuallv  has  a  biock  sue  'T  four  hvtes  wjijj  ^  iVich  »ize  >(  iivteen  t),  tes  I  his  .s  act  la.,  stf-r  bt  •  r.ar. 
subblock  placement  since  it  i^as  a  fuil  lag  on  rverv  ■‘suf>f)...c  k  ’ 

•  Tlie  higher  end  HP  9000  mach j nes  t  model  735  Hn<i  an- ^  :>r'  ..:e  a  •  act  . m: ’  :  '■<-  :  s: '  . ’  '  : 

The  hint  can  specify  that  a  oi».>ck  will  be  o\  erwri t ten  net  -  ,re  be::.g  reau.  .  n.s  a  .  -  .(iis  : .*1  '  :  -  . . 

Table  7:  Meinorv  subsystem  i>r"aiii/at ioti  "f  'i>me  popular  marhiiit-s 


to  main  memory  are  less  frequent  for  write  back  cacbes  than  f-T  write  i  hr'emii  i  at  iie.,  h;  ur.  .  a^c. 
writ"  buffer  penalties  are  negligible  even  for  wriie  t hnumh  caches  'secni'D 

Two  of  the  most  important  cache  parameters  are  wnle  iilUtrah  versus  uTih  ’m  iH  -'il-  aia;  mj' - 
block  placement  versixs  no  subblock  placement.  Of  t  hese.  the  ciunbinat  imi  a  rU'  n  .  lil  n  r.t-  ,v 

placement  offer  no  Improvement  over  write  no  allorair  no  .>ubblock  /ibir'no  ul  Imt  i  a<  he  perfpr 
mance.  Thus,  we  did  not  collect  data  for  the  write  no  nllocat^'  Mihbbwk  pbu--  no  nt  CM!iti'.”.ira!  ;":i. 

We  restrict  ourselves  only  to  the  first  two  levels  of  tite  memorv  luerardn.  whu  !i  mi  ■•■trr'T.i 

machines  corresponds  to  the  primary  cache  and  main  memorv.  The  results,  iiowever.  .ire  m. 
applicable  when  the  second  level  is  a  secondary  cache  and  the  cost  of  accessing  the  seromiarv  caciie 
is  similar  to  the  cost  of  accessing  main  memorv  on  the  DEC'Station  .uDOO  JOU'  h  In  sin  it  tiitichitie'. 
there  is  a  memory  subsystem  cont rib ut ion  to  t he  CPI  t hat  we  did  not  measure:  a  mi^>  on  the  >eronii 
level  cache.  Therefore  the  CPI  obtained  on  these  machines  can  be  higher  thati  riiai  ropor’ed  here 
We  do  not  simulate  the  exotic  features  appearing  on  some  newer  maciiine^.  ^urh  .i>  'ireani 
buffers,  prefetching,  and  victim  caches.  These  features  ran  reduce  the  cache  :ni>>  rtue^  aiui  mo' 
costs.  Further  work  is  needed  to  understand  the  impact  of  these  feature^  on  f)erforman(  e  o)  liea[) 
allocation. 

5.3  Memory  Subsystem  Performance 

Memory  subsystem  performance  is  presented  in  summarv  graphs  and  breakdown  grai)hs  !i,ic';i 
summary  graph  summarizes  the  memorv  subsystem  performance  of  one  l)enchmarK  proixrarn  for 
a  range  of  write-miss  policies  (write  allocate  or  no  write  allocate),  suhldork  placement  i  -vith  or 
without),  cache  sizes  (8K  to  I28K),  and  associativity  (1  or  2).  F.ach  curve  in  ,i  oimmar\  graph 
corresponds  to  a  different  memory  subsystem  organization.  There  are  two  ^ummarv  graph--  for 
each  program,  one  for  a  block  size  of  16  bytes  and  another  for  a  block  si/e  of  :12  hvtes.  1  .k  ii 
breakdown  graph  breaks  down  the  memory  subsystem  <jverhead  into  read  misses,  inst ru<  i ion  let cii 
misses,  write-buffer  overhead,  and  partial-word  write  overhead  for  one  configurat  ion  in  a  surnmar'. 
graph.  The  write-buffer  depth  in  these  graphs  is  fixed  at  6  entries. 


'^For  instance.  Borg  et  al.  8j  use  12  cycles  as  the  latency  for  goini;  to  the  secoiitl  level  i  a.  lie  .itul  .’"h  -’'a  .  o  h  . 
for  going  to  memory. 
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In  this  paper  we  present  only  the  summary  graphs  for  CW  (Figure  2).  Flie  summarv  graplis 
for  other  programs  are  similar  to  tliose  fur  CW  and  are  thus  omitted  fur  space  cuiisiderai i<jus.  Am 
significant  differences  between  CW's  graphs  and  the  omitted  graphs  are  noted  in  the  text.  Figures 
3  and  4  are  the  breakdown  graphs  for  CW  for  the  16  bvte  block  size  configurations;  the  remaining 
breakdown  graphs  for  CW  are  omitted  for  space  considerations.  The  breakdown  graphs  for  tlie  otiier 
benchmarks  are  similar  and  are  thus  also  omitted  for  space  considerations 

In  the  summary  graphs,  the  nops  curve  is  the  base  CPI;  the  number  of  useful  (not  iiopi  in 
structions  executed  divided  by  the  total  number  of  instruction  executed;  this  ci^rresponds  to  the 
CPI  for  a  perfect  memory  subsystem‘s.  For  the  breakdown  graphs,  the  nop  area  is  tin  CPI  con¬ 
tribution  of  nops;  read  miss  is  the  CPI  contribution  of  read  misses;  if  miss  is  the  CPI  contribution 
of  instruction  fetch  misses;  write  buffer  is  the  CPI  contril)ution  of  the  write  buffer;  partial  ward  is 
the  CPI  contribution  of  partial-word  writes'®. 

The  64K  point  on  the  write  alloc,  subblock,  assoc- I  curves  correspoiuis  close|\’  to  the  Dl.CSta- 
tion  5000/200  memory  subsystem. 

In  Sections  5.3.1,  5.3.2,  5.3.3,  and  5.3.4  we  describe  the  impact  of  write-miss  policy  and  subblock 
placement,  associativity,  block  size,  and  cache  size  on  the  inemorv  suljsvstem  |)erforniance  of  the 
benchmark  programs.  In  Section  5.3.5  we  give  the  write  buH'er  and  partial-word  write  overheads. 

5.3.1  Write  Miss  Policy  and  Subblock  Placement 

From  the  summary  graphs,  it  is  clear  that  the  best  cache  organization  we  studied  is  write  allo¬ 
cate/subblock  placement',  in  every  case,  write-allocate  subblock  placement  subsiatitiallv  outperforms 
all  other  configurations.  Surprisingly,  for  sufficiently  large  caches  with  the  write  allocate  subblock 
placement  organization,  the  memory  subsystem  performance  of  SML  NJ  programs  is  acceptable 
(around  17%  or  less  overhead)' '.  For  caches  with  write  allocate/ subblock  placement,  the  average 
memory  subsystem  contribution  to  the  CPI  over  all  benchmarks  is  16%  for  64K  direct  mapped 
caches  and  17%  for  32K  two-way  associative  caches.  The  DS5000/200  organization  does  well  for 
most  programs.  It  is  w'orth  emphasizing  that  the  memory  subsystem  performance  of  SML  NJ 
programs  is  good  on  some  current  machines  despite  the  very  high  miss  rates:  for  a  64K  write  allo¬ 
cate/no  subblock  placement  organization  with  a  block  size  of  16  bytes,  the  write  miss  and  read  miss 
ratios  for  CW  are  0.18  and  0.04  respectively. 

Recall  that  in  Section  5.1  we  argued  that  subblock  placement  would  be  a  big  wun,  but  its 
benefits  would  decrease  for  larger  caches.  Our  data  indicates  that  the  reduction  in  benefits  is  not 
substantial  even  for  128K  cache  sizes  although  a  slight  tapering  off  is  seen  in  CW.  This  indicates 
that  128K  is  not  large  enough  to  hold  the  allocation  area  of  most  of  the  benchmark  programs. 

The  performance  of  write  allocate/no  subblock  is  almost  identical  to  that  of  write  no  allocate  no 
subblock  (Leroy  is  an  exception).  This  suggests  that  an  address  is  being  read  soon  after  being 
written;  even  in  an  8K  cache,  an  address  is  read  after  being  wTitten  before  it  is  evicted  from  the 
cache  (if  it  was  evicted  from  the  cache  before  being  read,  then  write  allocate  no  subblock  would 
have  inferior  performance).  The  only  difference  between  these  two  schemes  is  when  a  cache  block 


'^Lexgen’s  graphs  are  a  little  different  in  that  there  is  a  steep  drop  in  the  instruction  cache  contribution  to  the 
CPI  in  going  from  an  8K  to  16K  cache. 

'*nop»  constitute  between  5.9%  and  15.4%  of  all  instructions  executed  for  the  benchmarks  (see  Section  4.3). 
'^This  overhead  is  so  small  that  it  is  not  visible  in  most  of  the  breakdown  graphs. 

'^For  the  penalties  used,  a  17%  overhead  translates  roughly  Jnto  one  fetch  from  memory — instniciion  or  d.ita  — 
every  100  useful  instructions. 
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is  read  from  memory.  In  one  case,  it  is  Ijrouf'hi  in  on  a  write  miss:  in  the  mluT.  it  i>  Ijroiiitiit  iti 
on  a  read  miss.  Because  S.ML  .\’J  programs  allocate  se<inentiallv’  and  do  feu  as^iirnment ^.  a  tn-uh 
allocated  object  remains  in  the  cache  until  the  progr.am  lias  allocated  another  C  bvti's.  where  C  is 
the  size  of  the  cache.  Since  our  programs  allocate  0.4-0. 9  hvte.s  per  instruction,  our  results  suggest 
that  a  read  of  a  block  occurs  within  9K-20K  instructions  of  it  being  written. 

5.3.2  Changing  Associativity 

From  Figure  2  we  see  that  increasing  associativity  improves  ail  organizations.  However  t  lie  improve¬ 
ment  in  going  from  one-way  to  two-way  set  associativitv  is  much  smaller  than  the  improvement 
obtained  from  subblock  placement;  in  most  cases,  it  improves  the  C’PI  bv  less  than  0.1.  Ihe 
maximum  benefit  from  higher  associativitv  is  obtained  for  small  cache  sizes  i  less  than  KiK  i.  Hov. 
ever,  increasing  associativity  may  increase  CPU  cvcle  time  and  thus  the  improvements  ma\  nut  he 
realized  in-practice  il9j. 

From  Figures  3  zind  4  we  see  that  higher  associativity  improves  the  instruction  cache  perfor¬ 
mance  but  has  little  or  no  impact  on  data  cache  performance.  The  improvement  ohseri  erl  in  going 
to  a  two-way  associative  cache  suggests  that  a  lot  of  the  penalty  from  the  itistruction  cache  is  liue 
to  conflict  misses  and  that  from  the  data  cache  is  due  to  capacity  misses;  the  data  cache  i-.  sitiipU. 
not  big  enough  to  hold  the  working  set.  When  the  code  produced  bv  SML  N.)  is  examitied.  the 
performance  of  the  instruction  cache  is  not  surprising:  the  code  consists  of  >mall  fututioiis  with 
frequent  calls,  which  lower  the  spatial  localitv.  Thus,  the  chances  itf  confiici-,  are  greater  :  hati  if 
the  instructions  had  strong  spatial  localitv. 

Surprisingly,  for  direct  mapped  caches  (Figures  3  (a)  and  4  (a))  the  instruction  cache  pettaltv 
is  substantial  for  caches  smaller  than  r28K.  For  caches  with  subblock  placement,  the  instruct  nut 
cache  penalty  dominates  the  penalty  for  the  memorv  subsystem.  The  instruction  cache  penaitv 
is  reduced  by  the  two-way  associative  cache  organizations,  suggesting  a  large  number  of  conflict 
misses  in  the  instruction  cache. 

5.3.3  Changing  Block  Size 

From  Figure  2  we  see  that  increasing  block  size  from  16  to  32  bvtes  also  improves  performance. 
For  the  write  allocate  organizations,  an  increased  block  size  decreases  the  number  of  write  misses 
caused  by  allocation.  When  the  allocation  area  does  not  fit  in  the  cache,  doubling  the  block  size  can 
halve  the  write-miss  rate.  Thus,  larger  block  sizes  improve  performance  when  there  is  a  i)enaltv 
for  a  write  miss  ,231.  In  particular,  larger  block  sizes  have  little  to  ulfer  to  caches  with  write 
allocate/subblock  placement.  From  Figure  2  we  see  that  the  write  no  allocate  organizations  benefit 
just  as  much  from  larger  block  size  as  write  allocate/ no  subblock  placement-,  this  suggests  that  the 
spatial  locality  in  the  reads  is  comparable  to  that  in  the  writes. 

Note  that  subblock  placement  improves  performance  more  than  even  two  wav  associativitv  .and 
32  byte  blocks  combined. 

5.3.4  Changing  Cache  Size 

Increasing  the  cache  size  improves  performance  for  all  configurations.  In  most  cases,  i  ho  porfor 
mance  improvement  from  doubling  the  cache  size  is  small.  We  expect  to  see  a  sharp  improvoment 
in  performance  for  some  larger  cache  size  (perhaps  2.56K  or  bigger)  once  the  allocation  area  fits 
in  the  cache  (this  w'ill  not  be  nearly  as  significant  for  caches  with  snhblock  [)iaremont  ).  From  iho 
breakdown  graphs  we  see  that  the  cache  size  has  little  effect  on  the  data  cache  miss  coni  ribni  ion 
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to  CPI.  Most  of  the  improvement  in  CPI  that  comes  from  increasing'  the  caclie  si^e  is  due  to  im¬ 
proved  performance  of  the  instructi(tn  cache.  .■\s  witli  nssociativitw  caclie  sixes  liave  ini eract ions 
with  the  cycle  time  of  the  CPU:  larger  caches  can  lake  longer  to  acres;:,  flms.  imiirovernent  due 
to  increasing  the  cache  size  may  not  be  achieved  in  practice. 

5.3.5  Write  Buffer  and  Partial- Word  Write  Overheads 

From  the  breakdown  graphs  we  see  that  the  write  biilfer  and  partial  word  write  contribution  to  the 
CPI  is  negligible.  .\  six  deep  write  buffer  coupled  with  page-mode  writes  is  suflicient  to  absorb  the 
bursty  writes.  As  expected,  memory  subsvstem  features  which  reduce  the  number  of  misses  (such 
as  higher  associativity  and  larger  cache  sizes)  also  reduce  the  write  Ijuffcr  overhead. 


H 


C)ck«/l)»«ful  iii«nitlk>n  CytIcsAJKful 


wnte*no-ftlloe.  no-subhlk.ttS!>4X=l 


write-oiloc.  !(uhblli:.a.<uoc=l 


%vTite>alloc.  fK>-kubbik.u&«ocsI 


wnte-no*ttUocjio-\ubblk.iissnc=2 


wnte-;ill<K;>ubbIk,a!:soc=2 


whte-iiiiocjv>>!cubblk.assoc=2 


(  and  Dcitfhe  sues 


(a)  block  bize=16  bytes 


wnte-no-AlIoc  jio>subblk.assoL-s  I 


wnte'alloc.subblk.usocs| 


♦ .  wnie-allocjx>-subblk.BMoca»l 


write -no-ttlloc  .no- *ubblk.aist>c=2 


writc-al1oc.&ubblk.a»Mx=2 


♦■  ■■  ■  wri!c-*lloejx>-subblk.®5Soc»2 


I  and  0  CBche  siee 


(b)  block  size=32  bytes 


Figure  2:  CW  summary,  write  buffer  depth=6 
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6  Conclusions 


We  described  an  in-depth  study  uf  tl\e  mernurv  siihsvstem  perforinaiiro  of  [)rin^ranis  <  unipiloii  with 
SML/XJ.  The  important  characteristics  of  tliesc  pnjgrarns,  with  respect  lo  nienior-.  >ub,s\  -.ie!n 
performance,  were  intensive  heap  allocation  and  the  use  of  copying  garbage  rolleci ion. 

In  agreement  with  '30,  37,  38,  39],  programs  with  intensive  heap  allocation  jx-rfoniied  poorh 
on  most  memory  subsystem  organizations.  However,  oti  some  current  machines  (in  pariicuhir  the 
DECStation  5000,  200),  the  performance  was  good. 

The  memory  organization  parameter  crucial  for  good  performance  was  suhhlock  [tlacernent.  1  or 
caches  with  subblock  placement,  the  memory  suhsvstem  overhead  was  under  1  7''c  for  fl  IK  or  hieger 
caches;  for  caches  without  subhlock  placement,  the  overhead  was  often  as  high  as  Kltr;. 

While  associativity,  cache  sizes,  and  block  sizes  affected  iterformance.  their  iimi  rihnt  ion  ii. 
performance  was  usually  small.  .Associativit v  and  c.ache  sizes  had  little  im|)act  I'li  oaia  <  ,i,  ;ie 
performance,  hut  were  more  important  for  instruction  cache  performance. 

To  summarize,  most  current  machines  support  heap  allocation  poorly.  For  these  machines, 
compilers  should  avoid  heap  allocation  as  much  as  possible.  However,  with  the  appropriate  meiiiMi  '. 
suhsvstem  organization,  heap  allocation  can  achieve  good  mernorv  sul;svstein  perforinaitce, 
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